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Abstract —Kernel matrices (e.g. Gram or similarity matrices) 
are essential for many state-of-the-art approaches to classi¬ 
fication, clustering, and dimensionality reduction. For large 
datasets, the cost of forming and factoring such kernel matrices 
becomes intractable. To address this challenge, we introduce a 
new adaptive sampling algorithm called Accelerated Sequential 
Incoherence Selection (oASIS) that samples columns without 
explicitly computing the entire kernel matrix. We provide con¬ 
ditions under which oASIS is guaranteed to exactly recover 
the kernel matrix with an optimal number of columns selected. 
Numerical experiments on both synthetic and real-world datasets 
demonstrate that oASIS achieves performance comparable to 
state-of-the-art adaptive sampling methods at a fraction of the 
computational cost. The low runtime complexity of oASIS and 
its low memory footprint enable the solution of large problems 
that are simply intractable using other adaptive methods. 

Index Terms —PSD Matrix Approximation, Kernel Machines, 
Column Subset Selection 

I. Introduction 
A. Kernel Matrix Approximation 

M ANY machine learning and data analysis frameworks 
for classification, clustering, and dimensionality reduc¬ 
tion require the formation of kernel matrices that contain the 
pairwise “similarities” or distances between signals in a col¬ 
lection of data. For instance, kernel methods for classification 
Q, nonlinear dimensionality reduction and spectral 

clustering 0 all require computing and storing annxn kernel 
matrix, where n is the number of examples (points) in the 
dataset. 

As the size n of the dataset grows, the computation and 
storage of kernel matrices becomes increasingly difficult. For 
instance, explicitly storing a kernel matrix with dimension 
n = 10 5 using IEEE standard binary64 requires 80 gigabytes 
of memory [5). To extend kernel-based methods to extremely 
large n, a host of methods have focused on sampling a small 
subset of columns from the kernel matrix and then computing 
a low-rank approximation of the matrix using some flavor of 
the Nystrom method (6j. 
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An accurate low-rank approximation determines and keeps 
only the most important dimensions in the column space of the 
matrix. In doing so, it captures the majority of the information 
in a matrix with a smaller ambient dimensional space. In the 
kernel matrix setting, the size of the dataset is larger than 
its dimensionality. This results in a kernel matrix with low¬ 
dimensional structure lying in a large ambient space. A low- 
rank approximation, then, can capture all of the information 
in the kernel matrix without requiring n 2 entries. 

Nystrom methods are one example of a general approach 
to computing low-rank matrix approximation from a subset of 
rows and/or columns of the matrix 0. Choosing a relevant 
subset is broadly referred to as column subset selection (CSS). 
CSS methods have been applied successfully in applications 
ranging from image segmentation |8j to genomic analysis [ 9] 
and matrix factorization m- 

The success of CSS-based approaches for matrix approxi¬ 
mation depends strongly on the columns chosen to approxi¬ 
mate the range space of the matrix. Intuitively, uniform random 
sampling of the columns will provide an accurate approxima¬ 
tion when the columns of the kernel matrix are independently 
and identically distributed. However, when the underlying 
data are non-uniformly distributed or even clustered, uniform 
sampling requires extra column draws to ensure an accurate 
approximation. In these settings, it has been shown in both 
theory 03 and practice 03 that adaptive sampling methods 
provide accurate approximations of low-rank kernel matrices 
with far fewer samples than uniform random sampling. In this 
way, we say that adaptive methods are more efficient than 
uniform random sampling. 

Current adaptive sampling methods take advantage of the 
structure of the kernel matrix. Random adaptive sampling 
methods use the entries of the kernel matrix to compute 
a weighted probability distribution over the column indices. 
The distribution improves the chances of sampling relevant 
columns. Deterministic adaptive sampling methods use the 
already-sampled columns to compute a residual over the kernel 
matrix, from which a new column index is selected. 

The downside of current adaptive methods is their compu¬ 
tational burden. Adaptive methods do not scale well to large 
problem sizes for two reasons. First, existing adaptive methods 
inspect the entire kernel matrix to determine which columns to 
select. For extremely large datasets, both forming and storing 
an explicit representation of the kernel matrix is intractable. 
Second, existing adaptive methods require dense nxn matrix 
computations, even for sparse matrices that are otherwise easy 
to store because they have relatively few non-zeros elements. 
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For these reasons, current adaptive methods cannot be applied 
to extremely large collections of data 1131. 


B. Contributions 


For adaptive sampling methods, it is not generally possible 
to determine the best columns to select from a kernel matrix 
without explicitly forming all of the candidate columns. How¬ 
ever, as the kernel matrix is symmetric, a small sample of 
columns provides partial information about the remaining un¬ 
sampled columns. Based upon this observation, we introduce 
a principled adaptive sampling scheme called Accelerated Se¬ 
quential Incoherence Selection (oASIS) that predicts the most 
informative columns to sample from a kernel matrix without 
forming the entire matrix. oASIS has several advantages over 
existing adaptive sampling schemes. 


• oASIS does not require a fully precomputed kernel 
matrix. It can operate solely on the data, using the 
kernel function. This is because oASIS selects the column 
to be included in the approximation before explicitly 
computing it. For this reason, only the submatrix of 
sampled columns must be computed/stored. 

• oASIS’s total runtime scales linearly with the matrix di¬ 
mension n, making it practical for large datasets. For this 
reason, oASIS is orders of magnitude faster than other 
adaptive methods that have 0{n 2 ) or higher runtime. 

• oASIS can exactly recover the rank r kernel matrix in r 
steps. 

• oASIS preserves zero entries in sampled columns of 
sparse kernel matrices, enabling greater efficiency in 
these cases. This is in contrast to conventional greedy 
methods that require dense n x n matrix computations 
that “fill in” the zero entries in a matrix [ 12J. 


oASIS provides a tractable solution for approximating ex¬ 
tremely large kernel matrices where other adaptive methods 
simply cannot be applied (H), 04), 03- In a range of 
numerical experiments below, we demonstrate that oASIS 
provides accuracy comparable to the best existing adaptive 
methods fl2) , |14|-|T6) with dramatically reduced complexity 
and memory requirements. 

While oASIS is useful for kernel matrices, its usefulness 
becomes more pronounced when the dataset is so large that 
it can no longer be held entirely in memory. This is because 
we can parallelize oASIS by splitting up the dataset and the 
working matrices among various processors. We introduce 
an algorithm called oASIS-P that efficiently distributes the 
data and the submatrices used in approximation, as well as 
the computation and selection of a new column to add to 
the approximation. oASIS-P is highly scalable as it incurs 
minimum communication overhead in between the parallel 
computing nodes. We implemented oASIS-P using a standard 
message passing interface (MPI) ED- With oASIS-P, we can 
perform the Nystrom approximation in a data size regime 
where even the simplest algorithms become difficult to run. 

In addition, we study oASIS from a theoretical perspective 
and propose conditions under which oASIS will exactly re¬ 
cover a rank-?’ kernel matrix in r steps. Other greedy methods 
do not have this guarantee. oASIS can perform this recovery 


because it chooses linearly independent columns at each step, 
which enables efficient sampling. Random selection methods 
do not necessarily choose independent columns, and can 
choose redundant columns, resulting in inefficient sampling. 

This paper is organized as follows. In Section [II] we intro¬ 
duce the Nystrom method, survey existing sampling methods, 
and describe important applications of kernel matrices in 
classification, clustering, and dimensionality reduction. In Sec¬ 
tion m we describe the motivation behind our initial column 
sampling method, called Sequential Incoherence Selection or 
SIS. We then describe the accelerated version of SIS, or 
oASIS. We then describe a parallel version of oASIS, which 
we call oASIS-P. In Section [TV| we provide theory determining 
the conditions under which oASIS will exactly recover the 
kernel matrix. And in Section [V] we use multiple synthetic 
and real datasets to demonstrate the efficacy of oASIS for 
approximating kernel matrices and diffusion distances for 
nonlinear dimensionality reduction |2j. 

II. Background 

To set the stage, we will quickly describe a few common 
kernel and distance matrices used in machine learning. Fol¬ 
lowing this, we introduce the Nystrom method and describe 
its variants and applications. 

A. Notation 

We write matrices G and vectors x in upper and lowercase 
script, respectively. We use A ' to denote the Moore-Penrose 
pseudo-inverse of A. We represent the element-wise product of 
matrices A and B as AoB. colsum(A) denotes a row vector, 
where the i th entry contains the sum of the i th column of A. 
When describing algorithms, we use “Matlab” style indexing 
in which G(i,j) denotes the (z, j) entry of the matrix G and 
G(:,j) denotes its j th column. We use A to denote a collection 
of chosen indices of the columns of a matrix A; A c is the set 
of indices not chosen. For example, /1 a are all of the columns 
of A indexed by the set A. 


B. Kernel Matrices and their Applications 

Kernel methods are widely used in classification and clus¬ 
tering to “lift” datasets into a high-dimensional space where 
the classes of data are linearly separable. This is done with 
the help of a kernel function which measures pairwise 

similarities between points in the lifted space. An n x n 
kernel matrix is then formed from n data points {^?}" =1 
with G(i,j) = k(zi,Zj ), where high magnitude entries of 
G correspond to pairs of similar data points. The singular 
vectors of G are then computed and used to map the data back 
into a low-dimensional space where the data is still linearly 
separable. The kernel trick is widely used in classification and 
clustering Q, g), §, (l8)-|[22) 

Manifold learning methods, including diffusion maps (2) 
and Laplacian eigenmaps 1231, map high-dimensional data that 
lie on nonlinear but low-dimensional manifolds onto linear 
low-dimensional spaces. These methods use a kernel matrix 
that encodes the geodesic distance between pairs of points — 
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the shortest path between two points along the surface of the 
underlying manifold. High-dimensional data are mapped into a 
low-dimensional space using the left singular vectors of G, and 
thus a singular value decomposition (SVD) of G is required. 
For a review of dimensionality reduction using geodesic and 
diffusion distance kernel matrices, see (U, (24). 


C. The Nystrom Method 

Williams and Seeger |JJ first presented the Nystrom method 
to improve the speed of kernel-based methods for classifica¬ 
tion. The method approximates a low-rank symmetric positive 
semidefinite (PSD) matrix using a subset of its columns. 

Consider an n x n PSD matrix G of rank r. For all PSD 
matrices G, there exists a matrix X £ R r x " such that G = 
X T X. Suppose we choose a set A of k < r indices and 
then sample those A columns from G as Ct, Ct £ R. nxk . 
We collect the k indices into a set A. The sampling forms a 
partition of X = [Aa A r .\^]. We can then express G as 


G = 


r y t i 

A A 
vT 
A Ac 


[A'a 


X A =] 


w k x£x A c 

xlx A xlx A c 


(1) 


where C k = \W k X^X^ consists of the nxk sampled 
columns of G, and W k = Xj^X A is the k x k symmetric 
matrix containing the row entries of the sampled columns at 
the indices of the sampled columns. Note that without loss of 
generality we can permute the rows and columns of G so that 
the columns in C k are the first k columns of G. The Nystrom 
approximation of G is defined as 


G « G k = CkWlCl. (2) 


Note that neither X nor any partition of X is found explicitly, 
but that the G k is found through the set of sampled columns 
Cfc and the respective rows W k . 

An approximate SVD of G can be obtained from the SVD 
of W k , which is written as W k = Uw^wUyv- The singular 
values £ of the approximation G k = LTEU T are given by 
{n/k)Y>w (25) , and the singular vectors are given by 

U = \j~^C k Uw^w ■ 

Since W k is fc x fc, this computation is much faster than 
computing the full n x n SVD of G. The complexity of the 
SVD step reduces from 0(n 3 ) to 0(k 3 ) with k < r <C n. 

Note that the Nystrom method enables the singular vectors 
of G, and thus a low dimensional embedding of the data, to 
be computed from only a subset of its columns. When n is 
large, it is desirable to form only a subset of the columns of G 
rather than calculate and store all n(n — l)/2 pairwise kernel 
distances. 


D. Column Sampling Methods 

We now describe the four main categories of column 
selection methods. We compare oASIS with the methods listed 
below, as together they cover all of the types of sampling used 
currently in Nystrom approximation. 


1) Uniform Random Sampling: Early work on the Nystrom 
method focused on random column sampling methods (Tj. 
Theoretical bounds on the accuracy of a rank-fc approximation 
to G after sampling a sufficient number of columns have been 
delveloped for various norms 0- 

Uniform random sampling is an appealing method as it is 
computationally fast. However, the accuracy of a matrix ap¬ 
proximation depends directly on the specific columns sampled. 
This method does not take advantage of the structure inherent 
in the kernel matrix, leading to redundant sampling and larger 
approximation error. Improvements on this sampling method 
can be made by weighting the probability distribution used for 
the column draw, to increase the chance of selecting relevant 
columns. 


2) Non-deterministic Adaptive Sampling: Leverage scores 
are a recent method for computing the distribution over the 
column draw [15]. Given the rank-fc SVD of G k = U k T, k U£, 
the scores are computed as Sj = \\U k (j, :)|| 2 , and each 
column is selected with probability proportional to its score. 
This method provides accurate approximations by sampling 
more relevant columns. However, Leverage scores require 
the low-rank approximate SVD of G to be precomputed at 
expensive 0(n 3 ) cost. There are fast approximations available 
for finding the first few singular vectors and values of G (26) . 
Regardless, G must be completely formed and stored before 
it is approximated. 


3) Deterministic Adaptive Sampling: Early deterministic 
adaptive methods G3 use an exhaustive search to find 
columns that minimize the Frobenius norm error ||G — G k \\p- 
While accurate, this method also requires a precomputed G, 
and has combinatorial complexity. A more efficient adaptive 
scheme by Farahat [ 121 builds a Nystrom approximation by 
selecting columns sequentially using a matrix of “residuals.” 
At each stage of the method, the column with the largest 
residual is selected, and the residual matrix is updated to reflect 
its contribution. While accurate, the method also requires a 
precomputed G, and the cost of updating the residual matrix 
is 0{n 2 ) per iteration. 


The residual criterion is related to the adaptive probability 
distribution calculated by Deshpande 0- After a sufficient 
number of columns are chosen, an orthogonalization step 
obtains a rank-fc approximation of G k . 


4) K-means Nystrom: Instead of approximating G with 
direct column sampling, an approximation can be made from 
representative data points found with a A'-means algorithm. 
A dataset consisting of clouds of points clustering around 
K centroids can be described by finding the locations of the 
centroids. Each datapoint is then remapped into the eigenspace 
of the centroids. This method was first described by Zhang 
(T§. Since the computed centroids do not exist in the dataset, 
the method does not directly sample columns, but remaps 
them onto a rank-A’ space. Once the solution to the I\- 
means is found, the remapping is 0(£n). While finding an 
exact solution to A'-means is NP-hard, generally AT-means 
will converge in 0(n 2 ) time. The resulting G can not be 
formed from the columns of G, and so has no space saving 
representation. 
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In a survey of methods by Kumar [251, K -means was found 
to be the state-of-the-art approximation method compared 
to previous sampling methods such as Incomplete Cholesky 
Decomposition (27), Sparse Matrix Greedy Approximation 
03- and Kumar’s Adaptive Partial method derived from 
Deshpande’s Adaptive Sampling method ED- For this reason, 
in lieu of comparisons with many different adaptive sampling 
techniques we can compare our results directly with A'-means. 
For our experiments, we used the same code as provided in 
03- with parameters used in both |16j] and 


E. Finding Low-Dimensional Structure 

In addition to using CSS for low-rank kernel approximation, 
column selection approaches have also been used to find 
important data points and in turn, reveal low-dimensional 
structure in data 


1291. Recently, it was shown in [ |30| 
that oASIS can be used to select representative examples from 
large datasets in order to enable clustering and dimensionality 
reduction. This method, called Sparse Self-Expressive Decom¬ 
position (SEED), consists of two main steps. First, oASIS is 
used to select data points for a dictionary. Second, all of the 
data is then represented by a sparse combination of the points 
in the dictionary using Orthogonal Matching Pursuit mo, ( 32 ). 
The sparsity patterns found in the representations can be used 
for for clustering, denoising, or classification. SEED’S ability 
to properly describe data hinges on the selection of data points 
used for the dictionary. oASIS is able to efficiently sample 
good points to use for this task, compared to other adaptive 
sampling methods. A full treatment of SEED is can be found 
in m 


III. Accelerated Sequential Incoherence 
Selection (oASIS) 

In this section, we introduce oASIS, a new adaptive sam¬ 
pling method for column subset selection. We also introduce 
a parallel version called oASIS-P, and we analyze the com¬ 
plexity of both. 


A. Sequential Incoherence Selection (SIS) 

We now address the question of how to build a Nystrom 
approximation by sequentially choosing columns from G. 
Suppose we have already selected k columns from G and 
computed a Nystrom approximation Gfc ■ Our goal is to select 
a new column that improves the approximation. If a candidate 
column lies in the span of the columns that we have already 
sampled, then adding this column has no effect on ||G — Gfc||. 
Ideally, we would like to quantify how much each candidate 
column will change the existing approximation and select the 
column that will produce the most significant impact. Since an 
ideal column does not lie in the span of the columns that have 
been selected, we say that this column should be incoherent 
with those already selected. 

We can develop a criteria for finding this new, incoherent 
column of G as follows. Consider a PSD matrix G. Recall 
that any such G can be written as X T X, where X con¬ 
tains n points in r dimensions. Given an index set A of A: 


columns we can form a partition X = \X,\ A A c] , and 
can collect the k selected columns from G into a matrix 
Gfc = Ga = [ Wk Xf, A a] . To improve the approximation 
Gfc, we select the best new column from G, append it to Gfc, 
and compute a new Gk+i . The best new column index i £ A c 
in G directly corresponds to the index of the column Xi that 
lies farthest from the span of A a . This column a;, satisfies 

argmax||(/-P A )xj|| 2 , (3) 

i6A° 

where I is the identity matrix and P A = A a A^ is an 
orthogonal projection onto the span of the k columns in A 
that have already been selected. Provided that the columns in 
X : \ are linearly independent, we can expand 0 as 

argmaxxfxj - xf A a (AJA a ) _ 1 A A Xi- (4) 

ie A c 

Even though A is not known explicitly, |4]i can be evaluated 
based upon knowledge of Ck and diag(G). The first term of 
the expression in (|4| is the diagonal entry i of G, which we 
denote as di. The second term can be written as bfW k bi, 
where bf = xf A a is one of the n — k rows of Ck indexed 
by A c , and Wk is comprised of the k rows of Ck indexed by 
A. When A a contains linearly dependent columns, we replace 
W k x with W f !. Therefore, we can iteratively select columns to 
sample for the approximation without precomputing the entire 
kernel matrix G, as shown in Figure [T] This sets oASIS apart 
from all other adaptive methods, as discussed in Sections [II-D2| 
and UI-D4l 

With the evaluation of our criteria now possible, we develop 
the following sampling method for sequential incoherent se¬ 
lection (SIS). We assume that the process has been seeded 
with a small set A of /c 0 column indices chosen at random. 
Columns are then selected to be included in the approximation 
as follows: 

1) Let k = |A|. Collect the columns of G indexed by A as 
Gfc. Form W k 1 from the rows of Gfc indexed by A. 

2) Let bf denote row * of Gfc, and let d, denote element i of 
diag(G). For each unselected column i £ A c , calculate 

Ai = di - bfW^b,. 

3) Select the column that maximizes || and set 

A <-Au{i}. 

4) If the selected value of |Aj| is smaller than a user set 
threshold, then terminate. Otherwise, return to Step 1. 


B. oASIS 


A naive implementation of SIS in Section III-A is ineffi¬ 
cient, because each step requires a matrix inversion to form 
W k 1 in addition to calculating the errors A,. Fortunately, both 
of these calculations can be performed efficiently by updating 
the results from the previous step using block matrix inversion 
formulas. We dub this new method oASIS. 

We first consider the calculation of W k \ after a new 
column is added to the approximation made from k columns. 
We assume throughout the rest of this section that Wk+i 


is invertible and thus Wl , IT', 


fc+i 


k+l' 


We show that our 
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XlX A 


xfX A 

u 


w 



bj 


i 


C k = Ga diag(G) 


Fig. 1. All of the terms in the criterion in {5} for the next column index 
to select can be found in the structure of the columns C k = Ga that have 
already been sampled, and the diagonal of G. This allows for a deterministic 
sampling without precomputing all of G. See Section [Til- A | for details. 


column selection rule guarantees the invertibility of W k in 
Section IIV-AI Let b denote the first k entries of the new 
column, d denote the relevant element of diag(G), and 
Afc + i = d — b T W k 1 b. Using a block inversion formula, we 
obtain 


= 


w k 

b 

-1 

W k 1 + sqq T -sq 

b T 

d 


T 

—sq s 


(5) 


where s = (d — b T W k 1 b)~ 1 = A^L is the (scalar valued) 
Schur complement and q = W k 1 b is a column vector. This 
update formula enables W k \ to be formed by updating W k 1 
and only requires inexpensive vector-vector multiplication. 
Note that W k +\ is invertible as long as A /. :+1 is non-zero, 
which is guaranteed since the algorithm terminates if A^+i = 
0 , in which case our approximation is exact. 

Now consider the calculation of A i = di — bJWlbi for all 
candidate columns i. Note that Cl = [hi, 62 , • • • ,b n ]. We can 
evaluate all bjw\bi simultaneously by computing the entry- 
wise product of C k with the matrix Rk := IV)) 1 C[ and then 
summing the resulting columns. If we have already formed Ck 
and Rk, then the matrix Rk+i = W k \ Cj +1 needed on the 
next iteration is obtained by applying © to Cfe+i to obtain 


Rk + 1 = w^cl +1 = Wfc-i 


cl 

L c fe+U 


( 6 ) 


R k + sq(q r Cl-cl +1 ) 
s(~d T Cl + cl +1 ) 

Equation © forms Rk+i by updating the matrix Rk from 
the previous iteration. The update requires only matrix-vector 
and vector-vector products. The application of this fast update 
rule to the method described in Section |III-A yields oASIS, 
detailed in Figure [2] 

oASIS can be initialized with a small random subset of 
ko starting columns from G. Next, the starting matrices Ck, 
IV) 1 and Rk = IV)) 1 Cj are formed. On each iteration of the 


Algorithm 1: oASIS 

Inputs: Symmetric matrix G € R nxn , 

Diagonal elements of G, stored in d. 

Maximum number of sampled columns, £, 

Initial number of sampled columns, k < I, 
Non-negative stopping tolerance, e. 

Outputs: The sampled columns C, 

The inverse of the sampled rows W~ x . 
Initialize: Choose a vector A € [1, n] e of k random starting 
indices. 

Ck = G(:,A) 

W^ 1 = G(A, A) -1 

^ = w^cl 

while k < t do 

A = d — colsum(Gt o R k ) 
i = argmax^ A |A(j)| 
if | A(z) | < e then 
return 
end if 
b = G(A,i) 
d = d(i) 
s = 1/A (?) 
q = R(:,i) 

C k+1 = [C k ,G(:,i)} 

Form W k } 1 using © 

Update Rk+i using © 
k <- k + 1 
A<-Au{i} 

end while 

Fig. 2. The oASIS algorithm. Note that G need not be precomputed. 


algorithm, the vector of Schur complements A is formed by 
computing 

A = d — colsum(Cfc o R k ). 

Next, the largest entry in A is found, and its index is used 
to select the next column from G. The update formulas © 
and © are then used to form the matrices W k+1 and R k +i 
required for the next iteration. 


C. Parallel oASIS 

For the case of kernel matrices G generated from a dataset 
{zi}2= i, oASIS does not need to explicitly store G. Therefore, 
oASIS saves time, as computing the full G is expensive. oASIS 
also saves space, as the full G increases quadratically with 
n. However, the dataset may be itself very difficult to store 
due to size. In addition, oASIS requires space to build C, 
W~ x , and R. In total, oASIS requires 0(mn + I 2 + 2£n) 
of memory. In cases where dataset is too large to fit in 
memory, the matrix operations for oASIS can be distributed 
among p separate processor nodes. We begin by arranging 
the dataset columnwise into a matrix Z. Each node stores a 
submatrix consisting of n/p columns of Z, a copy of 
W~ x , and the column entries of C k and R k corresponding to 
the results of the kernel function over the entries in Z^ with 
the entries in Z \. When a new column index i is selected. 
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%) 


n/p 


%k -fl 


T 

m 


n/p 


C(i) 


W- 


t~ k ~A\ 


t~ k -X 


A (i ) 


7K 


n/p 




Fig. 3. Diagram of a single node in oASIS-P. Each node retains a subset Z^\ 
of the data. The node receives the selected data point z k + 1 , and updates 
and W- 1 using the kernel function g(z{,Zj) as in Fig. I ll It then computes 
and broadcasts A ^ to determine the next point to sample from the entire 
dataset. See Section III I -Cl lor details. 


the node storing column vector z i: is found and the column 
is broadcast to all of the nodes. Each node can then calculate 
the appropriate new entries as needed, as shown in Fig. [3] For 
each new column, the size of the communicated vector is the 
dimensionality of the data point, which is much smaller than 
the kernel matrix. Fow communication overhead is an essential 
property of oASIS-P since, in distributed settings, the cost of 
internode communication can be larger than intranode compu¬ 
tation. The memory requirements for over each node becomes 
0(mn/p+£ 2 + 2£n/p+£m), which makes performing oASIS 
over datasets with millions of points tractable. 

We call this method Parallel oASIS, or oASIS-P, and it 
is detailed in Figure [4] The implementation makes use of 
the standard MPI commands Broadcast(data) (to send data 
from one node to every node) and Gather(variable ) (to 
concatenate variables in each node into a single variable on 
one central node). 


IV. Properties and Applications of oASIS 


A. Theoretical Guarantees of oASIS 

For general PSD matrices G of rank r, we can guarantee 
that oASIS will finish in r steps. We develop the theory needed 
to prove this guarantee in Sections |IV-A1 and IV-A2 When 
G is a Gram matrix, formed as G(i,j ) = Zi T Zj, we can use 
the sample index set A found with oASIS to make additional 
guarantees on approximating the dataset itself. We mention the 
guarantees briefly in Section |IV-A3| and they are described 
fully in |j30j. 

In Section |IV-A1| we show that oASIS will select linearly 
independent columns of G at each step. This becomes very 
useful in practice, as G is computed from the W', where IF = 
Xj^X\. If the selected columns of G are not independent, 
then IF is a singular matrix. By selecting linearly independent 
columns of G, oASIS can guarantee that W' = IF -1 , 
enabling time and space saving calculation of this element 
when computing G. We discuss this further in Section IV-A4 


Algorithm 2: oASIS-P 

Inputs: Dataset arranged as a matrix Z £ E r " x " 

Kernel function g(Zi,Zj), 

Number of nodes p indexed by (*), 

Maximum number of sampled columns, £, 
Initial number of sampled columns, k < £, 
Number n of columns in Z, 

Non-negative stopping tolerance, e. 

Outputs: The sampled columns C , 

The inverse of the sampled rows IF -1 . 
Initialize: Choose a vector A £ [l,n] £ of k random 
starting indices. 

Foad separate n/p column blocks of Z into 
each node as Z^y 
Broadcast(Z(y A)) as Z A . 

On each node (i) : 

d(i)(j) = g(Z {i) (:,j),Z {i) {yj)) over all j in Z (i) 

= g(Z^(:,i),Z A (:,j)) over all i in Z (i) 

and all j in Z A 

W(i,j) = g(Z A (:,i),Z A (:,j)) over all i in Z A 

and all j in Z A 
Compute IF -1 

Rw = w-'c^ 

while k < £ do 
On each node (*) : 

A (i) = d(i) - colsum(C , (i ) o i? w ) 

A = Gather(Ay')) 
i = arg max )YA |A(j)| 

if |A| < e then 
return 
end if 

Zk+i = Broadcast(Z(:,i)) 

A = Broadcast^ A) 

On each node (i) : 

C(i)fc +i = g(Z {i) (:,i),z k+ 1 ) over all i in Z {i) 

[C(i)X(i)k+ 1 ] 

q = g(Z A {:, i),z k+1 ) over all j in Z A 
s = 1/A 

Update IF -1 using © 

Update using © 

Z A = [Z A , Zk+i] 
k <— k + 1 
A<-Au{i} 

end while 

IF" 1 = W/- 1 
C = Gather (C/,)) 

Fig. 4. oASIS-P, for datasets too large for a single node. 


1) Independent Selection Property of oASIS: Given a PSD 
matrix G = X T X, rank(G) = rank(A) = r. In Lemma [T] 
below, we provide a sufficient condition that describes when 
oASIS will return a set of r linearly independent columns. 
This is a similar condition as that provided in |30|, although 
we provide an alternate proof. 

Lemma 1. At each step of A lt>. [ 2 ] the i th column of the matrix 
G is linearly independent from the previously selected columns 
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provided that A (i) > 0. 

Proof: We prove this by construction of A\. Consider 
adding a new column Xi to A a with nonzero A (i) = 
||(/ — P A )Xi\\l and i £ A c . Then Xi is linearly independent 
of each column in A a , and so Gk+\ = A T x,; is linearly 
independent from any other Gj = X T x :l . Therefore as long 
as A (i) > 0 at each step, the column selected will be linearly 
independent from the previous columns selected. ■ 

Remark. This result guarantees that oASIS will return a set 
of r linearly independent columns in r steps as long as the 
selection criterion A (i) f 0 holds before exact reconstruction 
occurs. While the algorithm may terminate early if A(i) = 0 
before r columns have been selected, we have not observed 
this early termination in practice. 

2 ) Exact Matrix Recovery: We now prove that when oASIS 
selects r columns from G, then G = G. 


Theorem 1. If oASIS chooses r columns from a PSD matrix 
G with rank(G) = r, the Nystrom G = G. 

Proof: As X is rank r, and oASIS has chosen r linearly 
independent columns, then at the next step all A (i) = 0 as 
\\{I ~ P\)Xi\\l = 0Vi. Therefore || (/ - A A A A t )x i ||| = 0, 
or || A - X A X A lX\\ F = 0. X = [X A X A ,], and there¬ 
fore ||X A c-X A X A tA A e|| F = 0, or A' A c = A A A A tA A c. 
Expanding G in terms of A a and A a =, 


G = CW ] C T = 


'aJaV 

AJ c A a _ 

AJA a 


[A a A a ] f [AJA A 


aTa a = 


aJa A c 


A A cA a aJ c A a ( aJ a a ) t A J A A c 


We examine the lower right block of this expansion as the 
others exactly match that of G in 0 - 
By Lemma [I] A a is full rank, and so 

A A c A a (A a A a )^ A a A A c = AJ c A a (AJA a )“ 

= aJ c (a a a^)a a = 

= aJcAac 


' _ 1 A t 


Aac 


Thus the expansion of G is equal to the expansion of G in 

0 - ■ 

3) Guarantees for oASIS when G is a Gram Matrix: When 

G is a Gram matrix, we can arrange the points in the dataset 
{zi}i = i 1 Zi £ R m columnwise into a matrix Z. Then we can 
write G = Z T Z, with Z of rank in. oASIS selects a set |A| = 
to such that Z = Pa (Z) exactly. This property is useful in 
solving a more general CSS problem than Nystrom, precisely 
formulated as 

min \\Z — P A Z\\p, (7) 

|A|=L 

where Z is an m x n matrix. This problem has combinatorial 
complexity, and many of the selection schemes for Nystrom 
arise from attempting to solve this more general problem. 
Indeed, the adaptive selection methods in |6j, |l 1 1 , and others, 
can also apply to this problem. We have developed guarantees 
on oASIS’s ability to exactly recover Z so that we can use the 
columns in Z A in developing a self-expressive decomposition 
of Z. Although a full treatment of SEED is described in 


we briefly describe this extension in Section II-E 


4) Comparison with Other Theory: oASIS can guarantee 
exact matrix recovery in an information theoretically efficient 
number of steps. We show a synthetic example in Figure [5] 
Using a dataset consisting of points drawn from a 2 D Gaussian 
distribution centered on (0,0) and points drawn from a 3 D 
Gaussian distribution centered on (0,0,1), we compute a 
PSD Gram matrix G with a resulting rank(G) = 3. oASIS 
selects columns linearly independent from previously selected 
columns at each step, increasing the rank of the approximation 
each time. This enables oASIS to use an iterative update of 
W~ x instead of computing W' after all columns have been 
selected. At 3 steps, oASIS terminates, with G = G within 
machine tolerance. 

Random or adaptive random sampling techniques have 
theoretical guarantees that G will be close to a rank-A: ap¬ 
proximation of G after a certain number of iterations @.G3- 
However, the lack of guarantees on column selection make for 
redundant sampling. As an illustration, we include separate 
trials of uniform random sampling in Figure [5] Uniform 
random sampling frequently selects columns within the span 
of previously selected columns at each step, and as a result 
the approximation error is generally higher than oASIS. In 
cases of very large data, this becomes a practical concern in 
computing both G and W'. 

B. Complexity of oASIS 

The rate-limiting step of oASIS in Fig. [2]is the computation 
of Rk+i by updating R^. Equation ([6]) enables this to be 
performed by sweeping over the entries of Rj~, which has 
dimension k x n. The complexity of a single iteration is thus 
0(kn). If I columns are sampled in total, then Yhk =l ^ n = 
\i(t + 1 )n entries must be updated. The resulting complexity 
of the entire oASIS algorithm is thus 0(l 2 n). In practice, the 
number of sampled columns £ is much less than n. This makes 
oASIS considerably more efficient than adaptive methods such 
as Farahat’s & which requires the computation of n x n 
residual matrices at each stage resulting in 0(£n 2 ) complexity. 
oASIS is also more efficient than Leverage scores [15), since 
the scores use an approximate SVD of G that requires 0(n 2 ) 
computations over dense matrices. oASIS is about as efficient 
as I\ -means Nystrom, with complexity O(In). However, K- 
means does not select columns, and instead forms the full G 
from the low-dimensional remapping. As a result, while K- 
means is useful in Nystrom approximation, it may not be as 
useful for more general CSS methods. oASIS, in contrast, can 
be used in more general CSS problems via the Gram matrix. 

If we only compare the speed in finding A, oASIS is much 
slower than uniform random sampling, with its 0(1) sampling 
speed. But oASIS also computes G and W~ x along the way, 
while these still need to be computed after selecting the 
columns to be used under uniform random sampling. In large 
data regimes, these become practical considerations that make 
implementation of uniform random sampling less efficient, as 
we discuss in Section HV-CI 

C. Complexity of oASIS-P 

The low complexity of oASIS makes it practical for ex¬ 
tremely large matrices where other adaptive sampling schemes 
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No. Columns selected No. Columns selected 

(b) (c) 


Fig. 5. For a PSD Gram matrix G formed from the dataset in (a), we compare approximation errors for G formed using oASIS vs. 5 separate trials of 
uniform random sampling in (b). We terminate trials at exact recovery. oASIS guarantees exact recovery after sampling 3 columns. Random sampling chooses 
redundant columns, and this inefficient sampling results in less accurate approximations. For uniform random sampling many of the error curves lie directly 
on top of each other, as the method repeatedly chooses redundant columns from the bottom cluster of data. The variety of columns sampled can be illustrated 
by plotting the rank of G by number of sampled columns in (c). oASIS chooses columns that increases the rank of G at each step. In contrast, choosing 
redundant columns results in a rank-deficient approximation. See Section |lV- A4| for details. 


are intractable. For oASIS-P, the computational complexity of 
oASIS is divided by the p nodes, such that each node has 
0(£ 2 n/p) complexity. At first blush, oASIS-P is still slower 
than uniform random sampling and its 0(1) sampling speed. 
However, in regimes where the dataset cannot be loaded en¬ 
tirely in memory, three practical considerations make uniform 
random sampling less competitive. First, forming i columns 
from a dataset with z i e takes at least 0(£n) 

time. For many applications, forming the columns as they 
are sampled is substantially more expensive than the process 
of adaptively selecting columns. Second, communication of 
data vectors among nodes becomes the bottleneck in the 
parallel implementation, and oASIS-P and random sampling 
appear competitive in terms of column selection/generation 
time, as shown in Table III Third, random sampling may 


For each dataset {z,;}” =1 in all classes, we consider Gaus¬ 
sian kernel matrices where G(i,j ) = exp(||zj — Zj|||/c7 2 ). For 
datasets in the first class we also consider diffusion distance 
matrices M = D~ 1 ^ 2 ND~ 1 G where D is a diagonal matrix 
containing the row sums of M, and N is a Gaussian kernel 
matrix For each dataset, we tune a to provide good matrix 
approximation for any sampling method. 

We compare oASIS to the following state-of-the-art 
Nystrom approximation methods: (i) uniform random sam¬ 
pling, (ii) Leverage scores fT3] (Section II-D2 1 , (iii) A'-means 
Nystrom approximation |16| (Section |II-D4 1 . and (iv) Fara- 


require substantially more columns than oASIS to achieve 
the same accuracy, in which case adaptive sampling is highly 
advantageous. 


hat’s greedy update method |I2) (Section II-D3| >. For methods 
(i), (ii), and (iii), we repeat experiments 10 times and average 
the results. For the second class of matrices we consider 
oASIS, uniform random sampling, and A'-means since the 
other methods become intractable when the matrix becomes 
too large to explicitly store. For the largest class of matrices we 
only consider oASIS and uniform random sampling. Specific 
datasets and experiments are described below. 


V. Numerical Experiments 
A. General Setup 

To evaluate the performance of oASIS, we measure the 
accuracy of Nystrom approximations for three size classes 
of kernel matrices. We first consider matrices where we can 
directly measure the approximation error of the Nystrom 
method. Second, we consider larger problems for which form¬ 
ing the entire kernel matrix is impractical. Third, we consider 
problems so large that the dataset will not fit in memory. 


B. Full Kernel Matrices 

Here, we consider datasets for which the kernel matrices 
can entirely fit in memory, making all the sampling methods 
tractable. Convergence curves are generated by forming Gk 
for increasing k and then calculating the approximation error 
defined by ||Gk — G||f/||G||f- We consider the following 
datasets, run using MATLAB on an iMac with a 2.7 GHz 
processor and 16GB of memory. Results and column selection 
runtimes at the largest sample sizes are shown for full matrices 
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Fig. 6. Nystrom Approximation Error Curves and Run Time Comparison. See Sections |V-A| and |V-B | for details. oASIS is accurate and scales well to large 
problem sizes due to its low runtime complexity (see Section |lV-B). 


in Table |T] oASIS is competitive with the most accurate 
adaptive schemes, at a fraction of the runtime. Figure [6] shows 
selected convergence curves (normalized error vs. number of 
columns sampled). Figure [6] also presents a plot of column 
selection runtime vs. matrix size for a variety of methods. Fig¬ 
ure [7] shows convergence curves (normalized error vs. number 
of seconds runtime) and column sampling rates (number of 
columns sampled vs. number of seconds runtime) for all 
adaptive methods using the Gaussian kernel. These curves 
allow for a fair assessment of approximation error achieved 
after a set run time for various adaptive methods, as methods 
will select columns at different rates. See Section IV-EI for a 
full discussion of these results. 

a) Two Moons: We consider a common synthetic dataset 
for clustering algorithms that consists of 2-dimensional points 
arranged in two interlocking moons. This set is 2,000 points of 
dimension 2. We set the kernel a equal to 5% of the maximum 
Euclidean distance between points. 

b) Abalone: The Abalone dataset is a collection of 
physical characteristics of abalone |33) , which are analyzed 
to find the age of the abalone without direct measurement. 
This set is 4,177 points of dimension 8. We set the kernel 
a equal to 5% of the maximum Euclidean distance between 
points. 

c) Binary Organization of Random Gaussians (BORG): 
The BORG assimilates sets of points clustered tightly around 
each vertex of an n-dimensional cube. The points around each 
vertex v are distributed as A ct R q RG I) with cr RORG = 0.1. 
This dataset is constructed to be pathologically difficult, with 
many clusters and many points per cluster. Many columns are 
needed from G to ensure sampling from each cluster. Using 
an 8 dimensional cube with 30 points at each vertex, the total 


dataset is 7,680 points of dimension 8. We set the kernel a 
equal to 12.5% of the maximum Euclidean distance between 
points. 

C. Implicit Kernel Matrices 

Here, we consider datasets for which the resulting kernel 
matrix would become impractical to explicitly calculate and 
store. Rather, columns from G are generated “on the fly” at 
the time they are sampled. Since a full representation of G 
is no longer available, we estimate the approximation error as 
the Frobenius-norm discrepancy between 100,000 randomly 
sampled entries in G and the corresponding entries in G k- 
Because the Leverage scores and Farahat m schemes 
require a full representation of G (which is intractable for these 
problem instances), we compare only with uniform random 
sampling and I\ -means. We consider the following datasets, 
run using MATLAB on an iMac with a 2.7 GHz processor 
and 16GB of memory. Results and column selection runtimes 
at the largest sample sizes are shown for implicit matrices in 
Table HD 

d) MNIST: The MNIST dataset consists of handwritten 
digits used as a benchmark test for classification |34) . MNIST 
training data contains 50,000 images of 28 x 28 = 784 pixels 
each. Similarity matrices formed from the digits are known to 
have low-rank structure, because there are only 10 different 
numerical digits. We set the kernel a equal to 50% of the 
maximum Euclidean distance between points. 

e) Salinas: The Salinas dataset is a hyperspectral im¬ 
age taken in 1998 by the Airborne Visible/Infrared Imaging 
Spectrometer (AVIRIS) over Salinas Valley, CA. The image is 
of size 512 x 217, over 204 spectral bands, and can be used 
to classify various areas of crops. Each pixel is assigned to 
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Fig. 7. Nystrom Error curves and number of samples based on time. See Section |V-B| for details. Note that oASIS is accurate and scales well to large 
problem sizes due to its low runtime complexity (see Section |lV-B). 


TABLE I 

Error and Selection Runtime Results for Explicit Gaussian (first line) and Diffusion (second line) Kernel Matrices. 


Problem 

n 

l 

oASIS 

Random 

Leverage scores 

iC-means 

Farahat 

Two Moons 

2,000 

450 

1.00e-6 (1.20) 
1.10e-6 (1.16) 

2.14e—3 (0.01) 
1.22e—2 (0.01) 

9.46e—4 (3.96) 
7.45e—3 (4.00) 

1.05e—3 (0.38) 
5.49e—3 (1.21) 

8.31e—7 (19.7) 
l.lle-6 (19.6) 

Abalone 

4,177 

450 

1.23e—6 (2.60) 
1.62e—6 (2.51) 

2.65e—3 (0.01) 
3.76e—1 (0.01) 

5.23e—4 (35.8) 
1.47e—1 (35.9) 

1.73e—3 (0.84) 
3.24e—1 (8.26) 

2.85e—7 (64.8) 
5.61e—7 (64.7) 

BORG 

7,680 

450 

5.30e—2 (4.71) 
6.29e—2 (4.73) 

3.90e—1 (0.01) 
3.90e—1 (0.01) 

4.31e—1 (252) 
4.23e—1 (244) 

8.89e—2 (2.53) 
7.70e—2 (48.3) 

2.75e—2 (176) 
2.78e—2 (174) 


TABLE II 

Error and Selection Runtime Results for Implicit Kernel Matrices. 


Problem 

n 

t 

oASIS 

Random 

iC-means 

MNIST 

50,000 

4,000 

1.53e—6 (8260) 

7.48e—5 (946) 

7.30e—7 (188) 

Salinas 

54,129 

5,000 

2.22e—5 (13372) 

1.61e—4 (819) 

1.33e—5 (1120) 

Light Field 

82,265 

4,000 

7.10e—6 (13600) 

2.54e—5 (989) 

9.44e—6 (1890) 
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TABLE III 

Error and Factorization Runtime Results for Parallelized Implicit Kernel Matrices. 


Problem 

n 

£ 

oASIS-P 

Random 

Two Moons 

1,000,000 

1,000 

5.10e—6 (108) 

5.90e—4 (279) 

Tiny Images 

1,000,000 

4,500 

2.76e—4 (10672) 

2.99e—6 (28225) 

Tiny Images 

4,000,000 

4,500 

5.20e—4 (14013) 

1.70e—5 (26566) 


one of 16 classes according to a ground truth image. Classes 
represent crops such as broccoli or lettuce. We consider all 
pixels assigned to a nonzero class, for a total number of 54,129 
data points. We set the kernel a equal to 10. 

f) Light Field: Light fields are 4-dimensional datasets 
describing both the intensity and directionality of light as it 
travels through a plane. We consider patches taken from the 
“chessboard” dataset of the Stanford Multi-Camera Array (35). 
The samples are 85,265 vectorized 4-dimensional “patches,” 
each with 4x4 spatial resolution and 5x5 angular resolution 
for a total dimensionality of 400. We set the kernel <r equal 
to 50% of the maximum Euclidean distance between points. 


D. Implicit Kernel Matrices with oASIS-P 


Finally, we consider datasets that cannot be fit in memory. 
As such, the dataset is split onto a variety of nodes and the 
kernel matrix is approximated using oASIS-P as described in 
Figure [4] At this size, we compare only with uniform random 
sampling. We choose a Gaussian kernel for all datasets in this 
class. 

Since a full representation of G is no longer available, 
we estimate the approximation error as the Frobenius-norm 
discrepancy between 100,000 randomly sampled entries in 
G and the corresponding entries in Gk . We consider the 
following datasets, run using OpenMPI with the Eigen C++ 
library (36) over 16 nodes (192 cores) on the DaVinCi cluster 
at Rice University, with each 2.83 GHz processor core having 
4GB of memory. Results at the largest sample sizes are shown 


for parallelized implicit matrices in Table III These sample 
sizes were chosen at the limit of available run time on the 
cluster when using uniform random sampling to both sample 
and form columns. 

g) Two Moons: This dataset is as described in Section 


V-B but the number of data points has been increased to 
1,000,000 points. Determining a good kernel er from the maxi¬ 
mum Euclidean distance among all points becomes intractable, 
and so we found a kernel a of 0.5 x \/3 that provided 
good approximations for all sampling methods at smaller trial 
datasets of Two Moons. This kernel a was then used for 
the full set. For this experiment, oASIS was mn to an error 
tolerance of le—4, and random sampling was performed for 
k e {20, 50,100, 200, 250,500,1000} samples. 

Ii) Millions of Tiny Images: To show the capability of 
oASIS to approximate kernel matrices over very large datasets, 
we select two random subsets of the 80 Million Tiny Images 
dataset (37), consisting of 1,000,000 and 4,000,000 RGB 
images of size 32 x 32. For reference, storing the full kernel 


matrix for 4,000,000 tiny images in binary64 would take up 
128,000 gigabytes of space. 

We compute the approximation over one color channel of 
the images as we consider the number of data points to be the 
prime focus of this experiment. Determining a good kernel 
a from the maximum Euclidean distance among all points 
becomes intractable, and so we found a kernel a of 20 that 
provided good approximations for all sampling methods at 
smaller trial datasets of Tiny Images. This kernel a was then 
used for the full set. 


E. Discussion of Results 

As shown in in Table [IJ oASIS achieves lower approxima¬ 
tion error than both uniform random sampling and Leverage 
scores for full kernel matrix experiments when given a set 
number of columns. In addition, it is competitive with both 
A'-means Nystrom and Farahat’s method in terms of accuracy 
while having substantially faster run times. 

In addition, oASIS’s strength as a deterministic method 
enables oASIS to run for a set length of time, as opposed to a 
fixed £. When running the experiments in Figure[7]for adaptive 
random schemes, it was not known a priori how many columns 
either A'-means or Leverage scores should sample. One must 
guess at the appropriate K or £ to use given a certain amount 
of time. Our experiments found the appropriate parameters 
through exhaustive search, resetting the clock and increasing 
£ for each trial until the time limit was reached. 

The primary advantage of oASIS over random selection 
schemes is its approximation accuracy. In the Abalone exam¬ 
ple in Figure [6] uniform random sampling does not provide 
better accuracy as more columns are sampled, while oASIS 
continues to find columns that can add significantly to the 
accuracy of the approximation. In Figure |7J we observe that 
oASIS achieves exact matrix recovery with Two Moons at 
about 30 seconds with around 1000 columns sampled. This 
is an example of the efficiency of oASIS’s column sampling. 
This efficient accuracy is necessary for the subsequent dimen¬ 
sionality reduction critical to most kernel machine learning 
tasks. We further observe that the error with A'-means flattens 
after 5 seconds. A'-means Nystrom first clusters the original 
data and computes centroids. It then remaps the data onto the 
eigenspace of the centroids, and then computes an approximate 
kernel matrix from this remapping. As such, there is a floor 
for the approximation based on the accuracy of the eigenspace 
calculation. This results in a best possible error for AT-means 
that can be overcome by oASIS. We observe a similar, flat 
error result for Leverage scores, as the SVD of the entire G 
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can be performed within the runtime limit. This does not occur 
when the datasets grow larger. 

The primary advantage of oASIS over other adaptive 
schemes is its efficiency. oASIS saves both time and space. 
In the runtime results shown in Figure [7] we observe fast, 
accurate approximation of both Two Moons and Abalone with 
fewer columns sampled. For the BORG dataset, oASIS is 
second only to A'-means, which is as expected given that 
BORG’s dataset containing of spherical clusters of equal 
variance exactly fits the inherent data model that AT-means 
clusters. When the data do not fit that model, as in Abalone 
or Two Moons, we can see the efficiency and accuracy gains 
of oASIS. We observe in Table [I] that while A"-means Nystrom 
runs faster for a single sample size t, it needs to be run multiple 
times for consistency. For example, while running a single 
A'-means Nystrom approximation on Two Moons Gaussian 
with n = 2000 and l = 450 takes 0.38 seconds, the 10 
runs used for consistency takes 3.8 seconds. Furthermore, A'- 
means approximations performed for l samples provides no 
remapping for any samples fewer than /:, nor an index set A 
of columns for CSS. 

The advantages of oASIS-P over uniform random selection 
become clear in its application. First, oASIS-P can guarantee 
an invertible W. Uniform random sampling can not guarantee 
that W will be invertible, and so we must calculate W' 
to compute G. Indeed, preliminary experiments frequently 
showed Wk to be rank-deficient. This is most likely due to 
the “birthday problem” - as more columns are selected, the 
chances that any two columns are of the same direction grows 
surprisingly fast. As oASIS can iteratively compute W ~ 1 , it 
does not need to invert an £ x £ matrix as a separate step. 
Note that 1% of a 1,000,000 point dataset still results in a 
10,000 x 10,000 W matrix. Each of DaVinCi’s cores had 4GB 
of memory, and uniform random sampling became infeasible 
after approximately 4,500 columns as computing became 
too memory intensive for a single node, and distributing the 
computation of is not straightforward. Second, while the 
complexity of oASIS-P appears much higher than uniform 
random sampling, in practice oASIS-P is faster than uniform 
random sampling at sampling and forming £ columns. Col¬ 
umn generation takes the same amount of time regardless 
of the column selection scheme, and in large data regimes 
communication of data vectors between nodes becomes the 
computational bottleneck. Computing an iterative W~ x is 
faster than computing Wf, and so for very large data regimes 
oASIS-P becomes faster than uniform random sampling, as 
shown in Table III For example, oASIS-P samples and forms 
columns over the Two Moons dataset in less than half of the 
time it takes for uniform random sampling to perform the same 
task. 

In addition to its low runtime complexity, oASIS is also 
capable of benefiting from sparse matrix structure. For such 
matrices, adaptive methods like the one in [ 121 requires the 
computation of n x n “residuals,” which may be dense even 
in the case that G is extremely sparse. In contrast, oASIS 
requires only the storage of much smaller i x n matrices. This 
benefit of oASIS is highly relevant for extremely large datasets 
where sparse approximations to similarity matrices are formed 


using A'-nearest-neighbor algorithms that only store the most 
significant entries in each matrix column. Further analysis is 
necessary, however, as fast approximation methods have been 
developed specifically for sparse kernel matrices ]38) . 

VI. Conclusion 

In this paper, we have introduced oASIS, a novel adaptive 
sampling algorithm for Nystrom based low-rank approxima¬ 
tion. oASIS combines the high accuracy of adaptive matrix 
approximation with the low runtime complexity and memory 
requirements of inexpensive random sampling. oASIS achieves 
exact matrix recovery in an optimal number of columns. 
We have demonstrated the speed and efficacy of the method 
by accurately approximating large matrices using a single 
processor. In addition, we have parallelized oASIS so it could 
be run over datasets of arbitrary size. This allows oASIS to 
be the only adaptive greedy method available in large data 
regimes. In addition, our numerical experiments show oASIS 
to be competitive with random schemes at this level, in both 
accuracy and speed. Using a dataset of 1,000,000 examples we 
are able to achieve 1% of the approximation error of random 
sampling methods. oASIS has been applied to sparse subspace 
clustering a, and future work will focus on applying oASIS 
to other machine learning tasks, such as manifold learning and 
spectral clustering. 
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